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Abstract 


Latent Dirichlet Allocation (LDA) mining 
thematic structure of documents plays an 
important role in nature language process¬ 
ing and machine learning areas. However, 
the probability distribution from LDA 
only describes the statistical relationship 
of occurrences in the corpus and usually in 
practice, probability is not the best choice 
for feature representations. Recently, em¬ 
bedding methods have been proposed to 
represent words and documents by learn¬ 
ing essential concepts and representations, 
such as Word2Vec and Doc2Vec. The em¬ 
bedded representations have shown more 
effectiveness than LDA-style representa¬ 
tions in many tasks. In this paper, we pro¬ 
pose the Topic2Vec approach which can 
learn topic representations in the same se¬ 
mantic vector space with words, as an al¬ 
ternative to probability. The experimental 
results show that Topic2Vec achieves in¬ 
teresting and meaningful results. 


1 Introduction 


Modeling text (words, topics and documents) is a 
key problem in nature language processing (NLP) 
and information retrieval (IR). The goal is to find 
short and essential descriptions which enable effi¬ 
cient processing of large systems and benefit ba¬ 
sic tasks such as classification, clustering, summa¬ 
rization and estimation of similarity or relevance. 

During the past decades, various models and 
solutions are proposed, such as Bag-of-Words 


(BOW) dHarris, 1954j ), TF-IDF dSalton and| 


McGill, 19831, Latent Semantic Analysis (LSA) 


( Landauer et al., 1998| ) and Probabilistic Latent 
Semantic Analysis (PLSA) dHofmann, 1999 1. But 
the best-known model is Latent Dirichlet Alloca¬ 
tion (LDA) dBlei et al., 2003| ) which describes 


the hierarchical relationships between words, top¬ 
ics and documents. In LDA, documents are repre¬ 
sented as probability distributions over latent top¬ 
ics where each topic is characterized by a dis¬ 
tribution over words. However, the probability 
distribution generated from LDA prefers to de¬ 
scribe the statistical relationship of occurrences 
rather than real semantic information embedded 
in words, topics and documents. Also LDA will 
assign high probabilities to high frequency words 
and those words with low probabilities are hard to 
be chosen as representatives of topics. But in prac¬ 
tice, low probability words sometimes distinguish 
topics better. For example, LDA will assign higher 
probability and choose “food" as representative 
other than “cheeseburger", “drug" other than “ari- 
cept" and “technology" other than “smartphone". 


Recently, distributed representations with neu¬ 


ral probabilistic language models (NPLMs) (Ben- 


gio et al., 2003| were proposed to represent 


words and documents as low-dimensional vec¬ 
tors in one semantic space, and achieved sig¬ 
nificant results in many NLP and ML tasks 
( |Collobert and Weston, 2008| |Mnih and Hin 


|ton, 20091 IMikolov et al., 2013 at |Mnih and 


Kavukcuoglu, 20l3| Huang et al., 2012 Le and 


Mikolov, 20141. In particular, Word2Vec pro¬ 


posed by Mikolov et al. (2013a I could auto¬ 


matically leam concepts and semantic-syntactic 
relationships between words like vec(“Berlin") - 
\&c{“Germany") = vec{“Paris") - France"). 


Doc2Vec (Para2Vec) proposed by Le and Mikolov 


(20141 achieves state-of-the-art performance on 


sentiment analysis. Naturally, in this paper, we 
want to answer the question that, what will happen 
if we embed topics in the semantic vector space? 


Following the ideas of previously proposed 
models for words and documents, we propose the 
model Topic2Vec as shown in Fig. [T] Based on the 
Word2Vec, we incorporates topics into the NPLM 
framework for learning distributed representations 






















































of topics in the same semantic space with words. 
Furthermore, words and topics naturally can esti¬ 
mate similarity and relevance with each other such 
as using cosine function rather than using proba¬ 
bility. 

In the experiments, we evaluate two differ¬ 
ent topic representations including embedding of 
Topic2Vec and probability of LDA in two aspects: 
listed examples and t-SNE 2D embedding of near¬ 
est words for each topic. The experimental results 
show that our Topic2Vec achieves distinctive and 
meaningful results compared to LDA. 


2 Related Models 

2.1 Latent Dirichlet Allocation 


Latent Dirichlet allocation (LDA) ( |Blei et ah. 


20031 is a probabilistic generative model that as¬ 


sumes each document is a mixture of latent topics, 
where each topic is a probability distribution over 
all words in vocabulary. Briefly, LDA generates a 
sequence of words as follows: 


• For each of the N word Wn in document d: 


- Sample a topic Zn ~ Multinomial(drf) 

- Sample a word Wn ~ Multinomial((/> 2 „). 

By Gibbs Sampling estimation, we obtain 
document-topic probability matrix 0 and topic- 
word probability matrix <1>. For a new document 
of arbitrary length, we can infer its involved latent 
topics and meanwhile we will assign a topic label 
for each word in the document. 


2.2 Word2Vec 


Inspired by Neural Probabilistic Language Model 
(NPLM) (Bengio et ah, 20031, Mikolov et 
al. (2013^ proposed Word2Vec including CBOW 
and Skip-gram for computing continuous vector 
representations of words from large data sets. 

When training, given a word sequence D = 
{rci,..., wm}, the learning objective functions are 
defined to maximize the following log-likelihoods, 
based on CBOW and Skip-gram, respectively. 

1 ^ 
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Figure 1: Learning architectures of Topic2Vec, 
where {wt- 2 ,wt-i,wt+i,wt+ 2 ) are context 
words and wt is the current word paired with a 
topic zt- 


Here, in Equation (lai, Wcxt indicates the context 
of the current word Wi. In Equation ( [Tb] ), k is the 
window size of context. For any variables Wj and 
Wi, the conditional probability p{wj\wi) is calcu¬ 
lated using softmax function as follows. 


p{Wj\Wi) 


exp(wj • Wi) 

• Wi) ’ 


( 2 ) 


where w, Wi and wj are respectively the word rep¬ 
resentations of word w, Wi and Wj, W is the word 
vocabulary. 


3 Topic2Vec 

Inspired by word2vec, we incorporate topics and 
words into the NPLM. We propose Topic2Vec 
as shown in Fig. [T] for learning distributed 
topic representations together with word represen¬ 
tations. Topic2Vec is also separated in CBOW 
and Skip-gram situations. For instance, given 
a word sequence {wt- 2 ,wt-i,wt,wt+i,wt+ 2 ), 
in which wt is the current word assigned 
with topic Zt by LDA. The CBOW pre¬ 
dicts the word wt and topic zt based on 
the surrounding words {wt- 2 ,wt-i,wt+i,wt+ 2 ), 
while the Skip-gram predicts surrounding words 
{wt- 2 ,wt-i,wt+i,wt+ 2 ) given current wt and zt. 

When training, given a word-topic sequence of 
a document D = {rci : zi, ...,wm ■ zm}, where 
Zi is the word Wi’s topic inferred from LDA, 
the learning objective functions can be defined to 
maximize the following log-likelihoods, based on 
CBOW and Skip-gram, respectively. 
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Figure 2: Nearest words and topics for each topic. Words are listed with conditional probabilities in LDA 
while words and topics are listed with calculated cosine similarity in Topic2Vec. 
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Topic2Vec aims at learning topic representa¬ 
tions along with word representations. Consid¬ 
ering the simplicity and efficient solution, we 
just follow the optimization scheme that used in 
Word2Vec (Mikolov et ah, 2013aI. To approxi¬ 
mately maximize the probability of the softmax, 
we use Negative Sampling without Hierarchical 
Softmax (Mikolov et ah, 2013b I. Stochastic gra¬ 
dient descent (SGD) and back-propagation algo¬ 
rithm are used to optimize our model. By the way, 
complexity of our Topic2Vec is linear with size of 
dataset, same with Word2Vec. 


4 Experiments 

4.1 Dataset 

We use the English Gigaword Fifth Editioij^as our 
training data for learning fundamental word and 
topic representations. We randomly extract part of 
documents and construct our training set described 
as follows: we chose 100,000 documents, where 
each consists of more than 1,000 characters from 
subfolder ltw_eng (Eos Angeles Times) containing 
411,032 documents. Besides, we eliminate those 
words that occur less than 5 times and the stop 
words. In the end, training set contains about 42 
million words and the vocabulary size is 102,644. 

4.2 Evaluation Methods 

In experiments, we run Topic2Vec in Skip-gram 
and learn topic representations together with word 
representations. And then we evaluate topic repre- 

^https://catalog.ldc.upenn.edu/LDC2011T07 
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Figure 3: t-SNE 2D embedding of the nearest word representation for each topic in LDA (left) and 
Topic2Vec (right). 


sentations via comparing Topic2Vec with LDA in 
two aspects: (1) we select most related topics or 
words conditioned on selected topics and (2) we 
embed these related words or topics in 2D space 
using t-SNE (Maaten et al., 20081. During the 
process, we cluster words into topics as follows: 


• LDA: each topic is a probability distribution 
over words. We select the top = 10 words 
with highest conditional probability. 

• Topic2Vec: topics and words are equally rep¬ 
resented as the low-dimensional vectors, we 
can immediately calculate the cosine similar¬ 
ity between words and topics. Lor each topic, 
we select higher similarity words. 


focuses on another specific topic about treat¬ 
ment {“anesthesiologists”, “anesthesia” and “co¬ 
matose”), they are absolutely different. Obviously, 
Topic2Vec presents more distinguished results be¬ 
tween two similar topics. 

Lig. shows the 2D embedding of the corre¬ 
sponding related words for each topic by using 
t-SNE. Obviously, Topic2Vec produces a better 
grouping and separation of the words in different 
topics. In contrast, LDA does not produce a well 
separated embedding, and words in different top¬ 
ics tend to mix together. 

In summary, for each topic, words selected 
by Topic2Vec are more typical and representative 
compared to those returned by LDA. Eventually, 
Topic2Vec can better distinguish different topics. 


4.3 Analysis of Results 

Lig. shows top 10 nearest words from LDA 
and Topic2Vec for eight typically selected topics, 
respectively. We now give more detailed anal¬ 
ysis to understand the difference between them. 
As shown in Lig. in Topic_19, LDA re¬ 
turns the words like “drug”, “drugs”, “cancer” 
and “patients”, while Topic2Vec returns “ari- 
cept”, “memantine”, “enbrel” and “gabapentin”. 
In Topic227, LDA returns the words of “medi¬ 
cal”, “hospital”, “care”, “patients” and “doctors”, 
while Topic2Vec returns “neonataF, “anesthesi¬ 
ologists”, “anesthesia” and “comatose”. We only 
know that Topic_19 and Topic_27 share the same 
topic about “patients” or “medical”, but we can’t 
get their further difference from the results of 
LDA. But from the result of Topic2Vec, we can 
easily discover that Topic_19 focuses on a more 
specific topic about drugs {“aricept”, “meman¬ 
tine”, “enbrel” and “gabapentin”), while Topic_27 


5 Conclusions and Future Work 

In this paper, via integrating NPLM, Word2Vec 
and LDA, we propose the Topic2Vec which suc¬ 
cessfully embeds latent topics in the same seman¬ 
tic vector space with words. In principle, our pur¬ 
pose clearly aims at learning new fashion topic 
representation by Topic2Vec. Lrom the observa¬ 
tion of experiments, Topic2Vec presents more dis¬ 
tinguished results than LDA and we have the con¬ 
clusion that Topic2Vec can model topics better. 

But now, we just qualitatively evaluate the per¬ 
formance of Topic2Vec and LDA, we will quanti¬ 
tatively do more detailed analysis about their dif¬ 
ference in the future. Besides, we have to run LDA 
firstly to assign a topic for each word in the cor¬ 
pus before Topic2Vec. We also will explore new 
independent topic models which can mine the¬ 
matic structure of documents as LDA and learn 
inherent representations and model topics better as 
















Topic2Vec, simultaneously. 
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